This report was generated on 2017-10-05 at 23:08:21

Zillow Prize Data Analysis Project

This Python 3 environment comes with many helpful analytics libraries installed It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python (a modified version of this docker image will be made available as part of my project to ensure reproducibility). For example, here's several helpful packages to load in

Import Libraries and Data:

Input data files are available in the "../input/" directory.

Any results I write to the current directory are saved as output.

parcelid airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 10754147 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 9.0 2015.0 9.0 NaN NaN NaN NaN
1 10759547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 27516.0 2015.0 27516.0 NaN NaN NaN NaN
2 10843547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN 650756.0 1413387.0 2015.0 762631.0 20800.37 NaN NaN NaN
3 10859147 NaN NaN NaN 0.0 0.0 3 7 NaN NaN ... 1.0 NaN 571346.0 1156834.0 2015.0 585488.0 14557.57 NaN NaN NaN
4 10879947 NaN NaN NaN 0.0 0.0 4 NaN NaN NaN ... NaN NaN 193796.0 433491.0 2015.0 239695.0 5725.17 NaN NaN NaN

5 rows × 58 columns

parcelid logerror transactiondate
0 11016594 0.0276 2016-01-01
1 14366692 -0.1684 2016-01-01
2 12098116 -0.0040 2016-01-01
3 12643413 0.0218 2016-01-02
4 14432541 -0.0050 2016-01-02
parcelid airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid ... structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock logerror transactiondate
0 17073783 NaN NaN NaN 2.5 3.0 NaN NaN 2.5 NaN ... 115087.0 191811.0 2015.0 76724.0 2015.06 NaN NaN 61110022003007 0.0953 2016-01-27
1 17088994 NaN NaN NaN 1.0 2.0 NaN NaN 1.0 NaN ... 143809.0 239679.0 2015.0 95870.0 2581.30 NaN NaN 61110015031002 0.0198 2016-03-30
2 17100444 NaN NaN NaN 2.0 3.0 NaN NaN 2.0 NaN ... 33619.0 47853.0 2015.0 14234.0 591.64 NaN NaN 61110007011007 0.0060 2016-05-27
3 17102429 NaN NaN NaN 1.5 2.0 NaN NaN 1.5 NaN ... 45609.0 62914.0 2015.0 17305.0 682.78 NaN NaN 61110008002013 -0.0566 2016-06-07
4 17109604 NaN NaN NaN 2.5 4.0 NaN NaN 2.5 NaN ... 277000.0 554000.0 2015.0 277000.0 5886.92 NaN NaN 61110014021007 0.0573 2016-08-08

5 rows × 60 columns

Large Negative Error     18442
Small Error              18432
Medium Negative Error    17973
Large Positive Error     17947
Medium Positive Error    17481
Name: logerror_bin, dtype: int64

Supplemental figures

(90275, 3)

Distribution of Target Variable:

/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Log-errors are close to normally distributed around a 0 mean, but with a slightly positive skew. There are also a considerable number of outliers, I will explore whether removing these improves model performance.

Proportion of Missing Values in Each Column:

/Users/marskar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
   parcelid  airconditioningtypeid  architecturalstyletypeid  basementsqft  \
0  10754147                    NaN                       NaN           NaN   
1  10759547                    NaN                       NaN           NaN   
2  10843547                    NaN                       NaN           NaN   
3  10859147                    NaN                       NaN           NaN   
4  10879947                    NaN                       NaN           NaN   

   bathroomcnt  bedroomcnt  buildingclasstypeid  buildingqualitytypeid  \
0          0.0         0.0                  NaN                    NaN   
1          0.0         0.0                  NaN                    NaN   
2          0.0         0.0                  NaN                    NaN   
3          0.0         0.0                  3.0                    7.0   
4          0.0         0.0                  4.0                    NaN   

   calculatedbathnbr  decktypeid         ...           numberofstories  \
0                NaN         NaN         ...                       NaN   
1                NaN         NaN         ...                       NaN   
2                NaN         NaN         ...                       NaN   
3                NaN         NaN         ...                       1.0   
4                NaN         NaN         ...                       NaN   

   fireplaceflag  structuretaxvaluedollarcnt  taxvaluedollarcnt  \
0            NaN                         NaN                9.0   
1            NaN                         NaN            27516.0   
2            NaN                    650756.0          1413387.0   
3            NaN                    571346.0          1156834.0   
4            NaN                    193796.0           433491.0   

   assessmentyear  landtaxvaluedollarcnt  taxamount  taxdelinquencyflag  \
0          2015.0                    9.0        NaN                 NaN   
1          2015.0                27516.0        NaN                 NaN   
2          2015.0               762631.0   20800.37                 NaN   
3          2015.0               585488.0   14557.57                 NaN   
4          2015.0               239695.0    5725.17                 NaN   

   taxdelinquencyyear  censustractandblock  
0                 NaN                  NaN  
1                 NaN                  NaN  
2                 NaN                  NaN  
3                 NaN                  NaN  
4                 NaN                  NaN  

[5 rows x 58 columns]
---------------------
(2985217, 58)
1    90026
2      123
3        1
Name: parcelid, dtype: int64
/Users/marskar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (22,32,34,49,55) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
(2985217, 58)
parcelid airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 10754147 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 9.0 2015.0 9.0 NaN NaN NaN NaN
1 10759547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN NaN 27516.0 2015.0 27516.0 NaN NaN NaN NaN
2 10843547 NaN NaN NaN 0.0 0.0 NaN NaN NaN NaN ... NaN NaN 650756.0 1413387.0 2015.0 762631.0 20800.37 NaN NaN NaN
3 10859147 NaN NaN NaN 0.0 0.0 3.0 7.0 NaN NaN ... 1.0 NaN 571346.0 1156834.0 2015.0 585488.0 14557.57 NaN NaN NaN
4 10879947 NaN NaN NaN 0.0 0.0 4.0 NaN NaN NaN ... NaN NaN 193796.0 433491.0 2015.0 239695.0 5725.17 NaN NaN NaN

5 rows × 58 columns

(90275, 3)
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
<matplotlib.figure.Figure at 0x1101805c0>
Exception ignored in: <bound method DMatrix.__del__ of <xgboost.core.DMatrix object at 0x10fa6cc50>>
Traceback (most recent call last):
  File "/Users/marskar/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 324, in __del__
    _check_call(_LIB.XGDMatrixFree(self.handle))
AttributeError: 'DMatrix' object has no attribute 'handle'
parcelid logerror transactiondate airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid ... numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 11016594 0.0276 2016-01-01 1.0 NaN NaN 2.0 3.0 NaN 4.0 ... NaN NaN 122754.0 360170.0 2015.0 237416.0 6735.88 NaN NaN 6.037107e+13
1 14366692 -0.1684 2016-01-01 NaN NaN NaN 3.5 4.0 NaN NaN ... NaN NaN 346458.0 585529.0 2015.0 239071.0 10153.02 NaN NaN NaN
2 12098116 -0.0040 2016-01-01 1.0 NaN NaN 3.0 2.0 NaN 4.0 ... NaN NaN 61994.0 119906.0 2015.0 57912.0 11484.48 NaN NaN 6.037464e+13
3 12643413 0.0218 2016-01-02 1.0 NaN NaN 2.0 2.0 NaN 4.0 ... NaN NaN 171518.0 244880.0 2015.0 73362.0 3048.74 NaN NaN 6.037296e+13
4 14432541 -0.0050 2016-01-02 NaN NaN NaN 2.5 4.0 NaN NaN ... 2.0 NaN 169574.0 434551.0 2015.0 264977.0 5488.96 NaN NaN 6.059042e+13

5 rows × 60 columns

Count Column Type
0 parcelid int64
1 logerror float64
2 transactiondate datetime64[ns]
3 airconditioningtypeid float64
4 architecturalstyletypeid float64
5 basementsqft float64
6 bathroomcnt float64
7 bedroomcnt float64
8 buildingclasstypeid float64
9 buildingqualitytypeid float64
10 calculatedbathnbr float64
11 decktypeid float64
12 finishedfloor1squarefeet float64
13 calculatedfinishedsquarefeet float64
14 finishedsquarefeet12 float64
15 finishedsquarefeet13 float64
16 finishedsquarefeet15 float64
17 finishedsquarefeet50 float64
18 finishedsquarefeet6 float64
19 fips float64
20 fireplacecnt float64
21 fullbathcnt float64
22 garagecarcnt float64
23 garagetotalsqft float64
24 hashottuborspa object
25 heatingorsystemtypeid float64
26 latitude float64
27 longitude float64
28 lotsizesquarefeet float64
29 poolcnt float64
30 poolsizesum float64
31 pooltypeid10 float64
32 pooltypeid2 float64
33 pooltypeid7 float64
34 propertycountylandusecode object
35 propertylandusetypeid float64
36 propertyzoningdesc object
37 rawcensustractandblock float64
38 regionidcity float64
39 regionidcounty float64
40 regionidneighborhood float64
41 regionidzip float64
42 roomcnt float64
43 storytypeid float64
44 threequarterbathnbr float64
45 typeconstructiontypeid float64
46 unitcnt float64
47 yardbuildingsqft17 float64
48 yardbuildingsqft26 float64
49 yearbuilt float64
50 numberofstories float64
51 fireplaceflag object
52 structuretaxvaluedollarcnt float64
53 taxvaluedollarcnt float64
54 assessmentyear float64
55 landtaxvaluedollarcnt float64
56 taxamount float64
57 taxdelinquencyflag object
58 taxdelinquencyyear float64
59 censustractandblock float64
Column Type Count
0 int64 1
1 float64 53
2 datetime64[ns] 1
3 object 5
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.
column_name missing_count missing_ratio
5 basementsqft 90232 0.999524
8 buildingclasstypeid 90259 0.999823
15 finishedsquarefeet13 90242 0.999634
43 storytypeid 90232 0.999524
/Users/marskar/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3162: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/Users/marskar/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3163: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]
assessmentyear 1
storytypeid 1
pooltypeid2 1
pooltypeid7 1
pooltypeid10 1
poolcnt 1
decktypeid 1
buildingclasstypeid 1
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
col_labels corr_values
49 taxamount -0.014768
21 heatingorsystemtypeid -0.013732
43 yearbuilt 0.021171
4 bedroomcnt 0.032035
18 fullbathcnt 0.034267
7 calculatedbathnbr 0.036019
3 bathroomcnt 0.036862
10 calculatedfinishedsquarefeet 0.047659
11 finishedsquarefeet12 0.048611
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
<matplotlib.figure.Figure at 0x112e5ae48>
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
<matplotlib.figure.Figure at 0x10e1843c8>
/Users/marskar/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
/Users/marskar/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    393                             if callable(meth):
    394                                 return meth(obj, self, cycle)
--> 395             return _default_pprint(obj, self, cycle)
    396         finally:
    397             self.end_group()

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    508     if _safe_getattr(klass, '__repr__', None) is not object.__repr__:
    509         # A user-provided repr. Find newlines and replace them with p.break_()
--> 510         _repr_pprint(obj, p, cycle)
    511         return
    512     p.begin_group(1, '<')

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    699     """A pprint that just redirects to the normal repr function."""
    700     # Find newlines and replace them with p.break_()
--> 701     output = repr(obj)
    702     for idx,output_line in enumerate(output.splitlines()):
    703         if idx:

~/anaconda3/lib/python3.6/site-packages/ggplot/ggplot.py in __repr__(self)
    114 
    115     def __repr__(self):
--> 116         self.make()
    117         # this is nice for dev but not the best for "real"
    118         if os.environ.get("GGPLOT_DEV"):

~/anaconda3/lib/python3.6/site-packages/ggplot/ggplot.py in make(self)
    634                         if kwargs==False:
    635                             continue
--> 636                         layer.plot(ax, facetgroup, self._aes, **kwargs)
    637 
    638             self.apply_limits()

~/anaconda3/lib/python3.6/site-packages/ggplot/stats/stat_smooth.py in plot(self, ax, data, _aes)
     75 
     76         smoothed_data = pd.DataFrame(dict(x=x, y=y, y1=y1, y2=y2))
---> 77         smoothed_data = smoothed_data.sort('x')
     78 
     79         params = self._get_plot_args(data, _aes)

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   3079             if name in self._info_axis:
   3080                 return self[name]
-> 3081             return object.__getattribute__(self, name)
   3082 
   3083     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'sort'
<ggplot: (282537350)>
<ggplot: (-9223372036568651127)>
Put a bird on it!
<ggplot: (-9223372036568636612)>

Analyse the Dimensions of our Datasets

Training Size:(90275, 61)
Property Size:(2985217, 58)
parcelid                              0
airconditioningtypeid           2173698
architecturalstyletypeid        2979156
basementsqft                    2983589
bathroomcnt                       11462
bedroomcnt                        11450
buildingclasstypeid             2972588
buildingqualitytypeid           1046729
calculatedbathnbr                128912
decktypeid                      2968121
finishedfloor1squarefeet        2782500
calculatedfinishedsquarefeet      55565
finishedsquarefeet12             276033
finishedsquarefeet13            2977545
finishedsquarefeet15            2794419
finishedsquarefeet50            2782500
finishedsquarefeet6             2963216
fips                              11437
fireplacecnt                    2672580
fullbathcnt                      128912
garagecarcnt                    2101950
garagetotalsqft                 2101950
hashottuborspa                  2916203
heatingorsystemtypeid           1178816
latitude                          11437
longitude                         11437
lotsizesquarefeet                276099
poolcnt                         2467683
poolsizesum                     2957257
pooltypeid10                    2948278
pooltypeid2                     2953142
pooltypeid7                     2499758
propertycountylandusecode         12277
propertylandusetypeid             11437
propertyzoningdesc              1006588
rawcensustractandblock            11437
regionidcity                      62845
regionidcounty                    11437
regionidneighborhood            1828815
regionidzip                       13980
roomcnt                           11475
storytypeid                     2983593
threequarterbathnbr             2673586
typeconstructiontypeid          2978470
unitcnt                         1007727
yardbuildingsqft17              2904862
yardbuildingsqft26              2982570
yearbuilt                         59928
numberofstories                 2303148
fireplaceflag                   2980054
structuretaxvaluedollarcnt        54982
taxvaluedollarcnt                 42550
assessmentyear                    11439
landtaxvaluedollarcnt             67733
taxamount                         31250
taxdelinquencyflag              2928755
taxdelinquencyyear              2928753
censustractandblock               75126
dtype: int64

There are several columns which have a very high proportion of missing values. It may be worth analysing these more closely.

Monthly Effects on Target Variable

      parcelid  logerror transactiondate  transaction_month
0     11016594    0.0276      2016-01-01                  1
4392  12379107    0.0276      2016-01-22                  1
4391  12259947    0.0010      2016-01-22                  1
4390  17204079    0.0871      2016-01-22                  1
4389  12492292   -0.0212      2016-01-22                  1

For submission we are required to predict values for October, November and December. The differing distributions of the target variable over these months indicates that it may be useful to create an additional 'transaction_month' feature as shown above. Lets have a closer look at the distribution across only October, November and December.

Proportion of Transactions in Each Month

    transaction_month  month
1            0.072623      1
2            0.070152      2
3            0.095840      3
4            0.103140      4
5            0.110341      5
6            0.120986      6
7            0.110186      7
8            0.116045      8
9            0.106065      9
10           0.055132     10
11           0.020227     11
12           0.019263     12

This datase contains more transactions occuring in the Spring and Summer months, although it must be noted that some transactions from October, November and December have been removed to form the competition's test set (thanks to nonrandom for pointing this out).

Feature Importance

   parcelid  logerror transactiondate  transaction_month  \
0  11016594    0.0276      2016-01-01                  1   
1  12379107    0.0276      2016-01-22                  1   
2  12259947    0.0010      2016-01-22                  1   
3  17204079    0.0871      2016-01-22                  1   
4  12492292   -0.0212      2016-01-22                  1   

   airconditioningtypeid  architecturalstyletypeid  basementsqft  bathroomcnt  \
0                    1.0                      -1.0          -1.0          2.0   
1                   -1.0                      -1.0          -1.0          1.0   
2                   -1.0                      -1.0          -1.0          1.0   
3                   -1.0                      -1.0          -1.0          4.0   
4                    1.0                      -1.0          -1.0          1.0   

   bedroomcnt  buildingclasstypeid         ...           numberofstories  \
0         3.0                 -1.0         ...                      -1.0   
1         2.0                 -1.0         ...                      -1.0   
2         3.0                 -1.0         ...                      -1.0   
3         4.0                 -1.0         ...                       2.0   
4         3.0                 -1.0         ...                      -1.0   

   fireplaceflag  structuretaxvaluedollarcnt  taxvaluedollarcnt  \
0             -1                    122754.0           360170.0   
1             -1                     37095.0           185481.0   
2             -1                    137012.0           240371.0   
3             -1                    373100.0           746200.0   
4             -1                     40729.0            61709.0   

   assessmentyear  landtaxvaluedollarcnt  taxamount  taxdelinquencyflag  \
0          2015.0               237416.0    6735.88                  -1   
1          2015.0               148386.0    3051.73                  -1   
2          2015.0               103359.0    5707.91                  -1   
3          2015.0               373100.0    8576.10                  -1   
4          2015.0                20980.0    1056.92                  -1   

   taxdelinquencyyear  censustractandblock  
0                -1.0         6.037107e+13  
1                -1.0         6.037532e+13  
2                -1.0         6.037541e+13  
3                -1.0         6.111008e+13  
4                -1.0         6.037571e+13  

[5 rows x 61 columns]
---------------------
(90275, 61)
   transaction_month  airconditioningtypeid  architecturalstyletypeid  \
0                  1                    1.0                      -1.0   
1                  1                   -1.0                      -1.0   
2                  1                   -1.0                      -1.0   
3                  1                   -1.0                      -1.0   
4                  1                    1.0                      -1.0   

   basementsqft  bathroomcnt  bedroomcnt  buildingclasstypeid  \
0          -1.0          2.0         3.0                 -1.0   
1          -1.0          1.0         2.0                 -1.0   
2          -1.0          1.0         3.0                 -1.0   
3          -1.0          4.0         4.0                 -1.0   
4          -1.0          1.0         3.0                 -1.0   

   buildingqualitytypeid  calculatedbathnbr  decktypeid         ...           \
0                    4.0                2.0        -1.0         ...            
1                    7.0                1.0        -1.0         ...            
2                    7.0                1.0        -1.0         ...            
3                   -1.0                4.0        -1.0         ...            
4                    7.0                1.0        -1.0         ...            

   numberofstories  fireplaceflag  structuretaxvaluedollarcnt  \
0             -1.0              0                    122754.0   
1             -1.0              0                     37095.0   
2             -1.0              0                    137012.0   
3              2.0              0                    373100.0   
4             -1.0              0                     40729.0   

   taxvaluedollarcnt  assessmentyear  landtaxvaluedollarcnt  taxamount  \
0           360170.0          2015.0               237416.0    6735.88   
1           185481.0          2015.0               148386.0    3051.73   
2           240371.0          2015.0               103359.0    5707.91   
3           746200.0          2015.0               373100.0    8576.10   
4            61709.0          2015.0                20980.0    1056.92   

   taxdelinquencyflag  taxdelinquencyyear  censustractandblock  
0                   0                -1.0         6.037107e+13  
1                   0                -1.0         6.037532e+13  
2                   0                -1.0         6.037541e+13  
3                   0                -1.0         6.111008e+13  
4                   0                -1.0         6.037571e+13  

[5 rows x 58 columns]
------------
0    0.0276
1    0.0276
2    0.0010
3    0.0871
4   -0.0212
Name: logerror, dtype: float64
                   features  importance
0         transaction_month    0.039308
1     airconditioningtypeid    0.006998
2  architecturalstyletypeid    0.000359
3              basementsqft    0.000310
4               bathroomcnt    0.007828
------------
                      features  importance
50  structuretaxvaluedollarcnt    0.083723
25                   longitude    0.077608
54                   taxamount    0.075427
24                    latitude    0.074305
26           lotsizesquarefeet    0.071182

Here we see that the greatest importance in predicting the log-error comes from features involving taxes and geographical location of the property. Notably, the 'transaction_month' feature that was engineered earlier was the 12th most important feature.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-1e807e847848> in <module>()
----> 1 test= test.rename(columns={'ParcelId': 'parcelid'})
      2 #To make it easier for merging datasets on same column_id later

NameError: name 'test' is not defined